The purpose of this markdown document is to list the steps we followed for refining the models using climate data only.
The data used in the below models are described in the Data Wrangle folder.
Note the below analysis uses the iNat data with 1510 observations. Amazing!
Below we compare models using canopy symptoms as the response variable
There are multiple methods to group the response variables deepening on desired resolution or fineness of the model.
For now, we can move forward with the binary response grouping because it is the broadest and easiest for the model to classify with.
All tree health categories
## # A tibble: 12 × 2
## # Groups: field.tree.canopy.symptoms [12]
## field.tree.canopy.symptoms n
## <fct> <int>
## 1 Branch Dieback or 'Flagging' 74
## 2 Browning Canopy 43
## 3 Candelabra top or very old spike top (old growth) 2
## 4 Extra Cone Crop 4
## 5 Healthy 813
## 6 Multiple Symptoms (please list in Notes) 31
## 7 New Dead Top (red or brown needles still attached) 48
## 8 Old Dead Top (needles already gone) 154
## 9 Other (please describe in Notes) 16
## 10 Thinning Canopy 225
## 11 Tree is dead 74
## 12 Yellowing Canopy 26
We also need to filter the data to only include response and explanatory variables we’re interested in. For example, whether a sound clip was included in the iNat data is not important.
We also need to remove other response variables like “field.percent.canopy.affected….” so it is not used as a predictor for tree health.
Note it might be interesting to know if the user was an important factor in predicting if the tree is healthy/unhealthy.
There are also a number of factors that should probably be removed because they may be biasing the data. For example, only trees with the ‘other factor’ question may only be answered for unhealthy trees. We need to think about this a bit more.
Remove variables with variables that have near zero standard deviations (entire column is same value)
Binary tree health categories
## # A tibble: 2 × 2
## # Groups: field.tree.canopy.symptoms [2]
## field.tree.canopy.symptoms n
## <fct> <int>
## 1 Healthy 813
## 2 Unhealthy 695
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 21
##
## OOB estimate of error rate: 30.7%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 586 227 0.2792128
## Unhealthy 236 459 0.3395683
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = monthless.binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 30.97%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 582 231 0.2841328
## Unhealthy 236 459 0.3395683
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = normal.monthless.binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 30.57%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 584 229 0.2816728
## Unhealthy 232 463 0.3338129